Knowledge-based Wrapper Generation by Using XML

نویسندگان

  • Heekyoung Seo
  • Jaeyoung Yang
  • Joongmin Choi
چکیده

Information extraction is the process of recognizing the particular fragments of a document that constitute its core semantic content. However, most previous information extraction systems were not effective for real-world information sources due to difficulties in acquiring and representing useful domain knowledge and in dealing with structural heterogeneity inherent in different sources. In order to resolve these problems, this paper proposes a scheme of knowledge-based wrapper generation for semi-structured and labeled documents. The implementation of an agent-oriented information extraction system, XTROS, is described. In contrast with previous wrapper learning agents, XTROS represents both the domain knowledge and the wrappers by XML documents to increase modularity, flexibility, and interoperability among multiple parties. XTROS also facilitates simpler implementation of the wrapper generator by exploiting XML parsers and interpreters. XTROS shows good performance on several Web sites in the domain of real estates, and it is expected to be easily adaptable to different domains by plugging in appropriate XML-based domain knowledge.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or application...

متن کامل

An XML-enabled data extraction toolkit for web sources

The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text files. Data in these formats are not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applicat...

متن کامل

XML-Enabled Data Extraction for Web Sources

The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or application...

متن کامل

Java-COM integration with JACOB using XML wrappers

ManyWindows-based legacy applications can be programmatically accessed using COM interfaces. However, calling COM components from Java is not straightforward. This report compares four open source Java-COM integration packages. A technique for typesafe Java-COM integration is presented. The technique is based on typesafe COM interface wrappers using jcom, java2com and JACOB libraries. Examples ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001